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Community Detection via Facility Location 
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In this paper we apply theoretical and practical results from facility location theory to the prob- 
lem of community detection in networks. The result is an algorithm that computes bounds on a 
minimization variant of local modularity. We also define the concept of an edge support and a 
new measure of the goodness of community structures with respect to this concept. We present 
preliminary results and note that our methods are massively parallelizable. 
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I. INTRODUCTION 

In this paper, we apply results from facility location 
theory to community detection. Leveraging recent de- 
velopments in both fields, we compute a weighting of 
the input graph that represents pertinent information 
for community detection algorithms. We show how to 
compute this weighting efficiently using techniques from 
facility location theory. We can interpret the weights as 
probabilities and randomly sample over a space of good 
community assignments. Computing the weights involves 
solving a linear program (LP) [1^] that has special struc- 
ture. Solvers for this special kind of LP require only 
linear space and linear time per iteration. Furthermore, 
this solution strategy is amenable to massive parallelism. 

We also give new measures for evaluating the quality 
of community assignments and show that our algorithms 
provide a provable lower bound on solution quality with 
respect to one of these. We demonstrate empirically that 
another of our measures is complementary to modularity, 
and that optimizing based on this new measure better 
resolves small communities in large graphs and better 
matches common sense community structures in famil- 
iar datasets. Thus, we make four contributions in this 
work: we demonstrate a connection between community 
detection and facility location; we use that connection 
to compute lower bounds on solution quality; we show 
how to compute new measures for the goodness of com- 
munity structure that contrast with modularity; and we 
apply massively parallelizable methods to compute these 
bounds and measures. 



II. BACKGROUND 

Newman and Girvan's concept of modularity [l^ is 
now ubiquitous in the community detection literature. 
There are several variations on this concept, such as 0, 
0, El, , and many heuristics to optimize the origi- 
nal concept and these variations, e.g. [13] @- In order to 
compute community structures with good modularity in 
large network instances, researchers commonly use one of 
two approaches: greedy heuristics, such as Q and p^ . 
and metaheuristic approaches, such as simulated anneal- 
ing Agarwal and Kempe [l| applied mathematical 



programming to the problem of maximizing modularity, 
resulting in an algorithm to compute upper bounds for 
that measure. 

We present an alternative that employs results from 
the vast facility location literature to community detec- 
tion. We model a variation of modularity as an uncapac- 
itated facility location problem (to be defined below), and 
employ the simple and powerful Volume algorithm [2] to 
solve the problem. Mulvey and Crowder [lj| used similar 
techniques, applying older subgradient methods, to solve 
p-median problems that approximately cluster points in 
n-dimensional space. 

We first observe that specializing a minimization ver- 
sion of the modularity problem produces an uncapaci- 
tated facility location problem. We then discuss its solu- 
tions and the interpretation and use of its results. 



III. STRONGLY-LOCAL MODULARITY (SLM) 

Girvan and Newman define the modularity (Q) for a 
graph G as follows: Q = J2si^ss — a^), where s is a 
community in the domain {1 . . . g}, Crs is the fraction of 
E{G) (the edge set of the graph) that connects a node 
in community r to one in community s, and is the 
fraction of edges that have at least one endpoint in s 
(fls = J2r^rs)- Squaring gives the probability that 
an edge would have both endpoints in community s in 
a random graph with the same endpoint degree distri- 
bution. Modularity is a way to measure the quality of 
community assignment: it rewards communities that are 
better connected than would be expected in a random 
graph reflecting the endpoint degree distribution. 

Now consider a simple variation of the modularity con- 
cept: Q~ = J2si^ ~ i^ss —a-D). Minimizing Q~ is similar 
to, though not identical to, maximizing Q. Basic algebra 
shows that a community assignment minimizing has 
at most as many communities as one that maximizes Q, 
and this is typically a strict inequality. 

It is well-known that community assigments of max- 
imum modularity fail to resolve small communities in 
large graphs [8]. Reflecting on this work, it would seem 
that the Q~ measure will compound this problem by re- 
solving even fewer communities. However, we provide a 
remedy via a further modiflcation described below, and 
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our switching of optimization sense will prove useful. 

Muff, Rao, and Caflisch ^13| define the local modularity 
to be the same as modularity, except that the denomi- 
nators in the fractions e^s are the numbers of edges in 
a cluster's "neighborhood," defined to be itself and all 
neighboring clusters. We use a metric that also focuses 
on local structure, but is even more restrictive, requir- 
ing no information about the structure of neighboring 
communities. We define a strongly local community to 
consist of a single representative node and all of its im- 
mediate neighbors, i.e., a full community of radius one. 
Let Qs = Css — a^. We can compute this measure for any 
strongly local community without knowing any commu- 
nity assignments other than the vertices in s. Ignoring 
algorithmic details, we need only know the number of 
triangles in the strongly local community and the degree 
of each node. 

Now we give the key definition that allows us to model 
the problem using facility location theory. Let 

/=) J if s is a strongly local community 
^ 1 otherwise 

We define the Strongly- Local Modularity (SLM) as fol- 
lows: 

s 

We use SLM in combination with a relaxed notion of 
community assignment in which community representa- 
tives can share common neighbors within their respective 
communities. 

IV. MODELING SLM AS A FACILITY 
LOCATION PROBLEM 

We transform instances of the community detection 
problem into instances of the {Incapacitated Facility Lo- 
cation Problem (UFLP)[l^. Given a set of potential fa- 
cility locations L, a set of customers C, a set of facility 
opening costs fi , and a set of service costs Cy (the cost to 
serve customer j using facility z), the objective function 
of UFLP is 

iGL ieL.jeC 

where the variables Xi indicate whether or not location i 
hosts a facility, and the variables yij indicate whether or 
not location i serves customer j. Solutions to UFLP min- 
imize F(x) subject to the constraints that every customer 
must be served, and that no customer can be served by 
a facility that does not exist. UFLP is a well known NP- 
hard problem [ll| , yet it has special structure that 
enables efficient computations of fractional solutions. 

We consider all vertices to be potential facility loca- 
tions, with facility opening costs fs = {I — Qs)- Each 
vertex is a customer that must be served by a facility 



(and may serve itself if it hosts a facility). The service 
cost is zero for a node to serve a neighbor in the graph. 
Nodes cannot serve non-neighbors (cost is effectively in- 
finite). The solution to the UFLP is a minimum-cost 
facility and service assignment in which every vertex is 
served. 

UFLP is an integer program (IP), but we need only 
solve the linear programming relaxation of the IP[12||. 
This relaxation has special structure that obviates the 
need for a general linear program solver. We apply La- 
grangian relaxation in conjunction with an elegant sub- 
gradient method known as the Volume algorithm (VA) 
in the Lagrangian relaxation framework of |3i] . The mem- 
ory usage of this combined procedure is on the order of 
the problem input size. VA makes a series of linear-time 
passes over the data. There are no known asymptotic 
bounds on the number of iterations. However, in prac- 
tice, the total runtime is comparable to the O(nlog^n) 
runtime of the most familiar fast modularity heuristic, 
the CNM greedy algorithm We have observed this 
experimentally on graphs with up to 100 million edges. 

The volume algorithm provides a fractional solution to 
the UFLP that in turn provides a provable lower bound 
on where all communities are strongly local. 




FIG. 1: The support of Zachary's karate club. Solid edges 
have stronger support than speckled edges and larger vertices 
are more likely to be leaders. Note the nearly-invisible edges 
linking portions of the club destined to split. 

Our community-assignment procedure selects a set of 
facilities to "open." Each open facility represents a leader 
of a subset of a strongly-local community. That is, every 
community has at least one node that is adjacent to all 
other nodes in the community. The set of leaders, there- 
fore, forms a dominating set, that is, a set of vertices 
D such that each vertex in the graph is either in D or 
adjacent to an element of D. 

In our community-finding procedure, called SNL, we 
set the facility-opening costs as described above and use 
VA to compute an optimal fractional placement of facili- 
ties. We then open each facility with probability equal to 
its fractional assignment value. If this does not produce 
a dominating set, then we repair it to make a dominating 
set. We then assign all the other vertices to a commu- 
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FIG. 2: The support of Zachary's karate club and its relationship to actual solutions of various algorithms. The support 
variance Vars decreases as solutions agree more closely with the support. Note that the community assignments with maximum 
modularity split edges with strong support within both of the true communities. 



nity. There are a number of ways one can do this. In this 
paper, we assign each non-selected node to the selected 
neighbor with highest fractional facility placement. 

V. THE SUPPORT 

We define the support of an edge (u, v) to be a real 
number between and 1 that indicates the level of sup- 
port/evidence for nodes u and v being in the same com- 
munity. Given any randomized algorithm A for commu- 
nity detection, such as the metaheuristic approach of [13], 
we can compute a support with respect to A by sam- 
pling: generate many community assignments using A, 
then compute the fraction of times each edge is intra- 
community. We now show how to compute a support 
with respect to SNL without sampling. 

Given a fractional solution x to an instance of UFLP, 
we define the support with respect to SNL to be a set 
of values z, where Zj is a probability that in a set of 
community leaders sampled from x, edge j could link 
two vertices in the same community. Formally, 

Ze={v,w) = ^ - [{'^ - Xv) * (l - Xw) *TlueN{v)nNiw){^ - Xu)]- 

An edge e = (v,w) has strong support if it is unlikely 
that none of the vertices capable of serving both v and w 
will become a server. This includes v, w, and their mu- 
tual neighbors. Figure^] depicts the support of Zachary's 
karate club dataset [l0|, an abstraction of a social net- 
work that famously split into two. The larger vertices 
and darker edges have higher x and z values, respec- 
tively. Even before community assignments have been 
specified, the community structure begins to emerge in 



fractional form. Note that some edges that are destined 
to become inter-community edges have very low support 
and are therefore almost invisible. 

Given the support of a graph, we define a new measure 
to evaluate the effectiveness of community assigments. 
We define the support variance (Vars) as follows, assum- 
ing that S{v,w) is an indicator function with value 1 if w 
and w are in the same community and otherwise. 

Var^ = ^ {6{v,w) - y^^)'^. 

{v,w)eE(G) 



VI. PRELIMINARY COMPUTATIONAL 
RESULTS 

We applied our methods to several familiar datasets. 
Figure m shows the support of Zachary's karate club The 
colored images in Figure O depict the solutions of four 
algorithms: our facility location-based rounding heuris- 
tic (SNL); the CA^M greedy algorithm; a combination of 
these two (SNL-CNM), in which SNL is used to com- 
pute strongly-local communities, then CNM is allowed 
to merge these; and the eigenvector-based approach of 
Newman, augmented with a Kernighan-Lin-like postpro- 
cessing step (Newman-KL) [Tsj . Newman-KL gives one 
of the best known values for modularity. 

In this case, intuition and history favor the facility 
location-based community assigments with low support 
variance over those with high modularity. For example, 
the latter split the topmost community despite reason- 
ably strong support for the edges holding it together. 
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FIG. 3: Maximizing modularity on these instances is known to produce non-intuitive answers. However, each instance has a 
support that agrees with common sense and leads to intuitive rounded solutions. The left hand instance is from [l^, and the 
right instance is the ring-of-cliques example from§]. As more cliques are added to the ring, modularity optimization will merge 
cliques, increasing the support variance. The facility location-based solution is not sensitive to the number of cliques. 



Figure [3] shows two instances that have been demon- 
strated in recent literature to present inherent problems 
for modularity algorithms. The modularity of the left 
hand instance, from tricks greedy algorithms into 
merging the endpoints of the edge that has the least sup- 
port in their first step. The right hand instance, from [8|, 
has been used to show that modularity optimization fails 
to resolve small communities in large graphs. The ex- 
ample shown is a ring of ten 5-cliques, and grouping the 
5-cliques individually both minimizes support variance 
and maximizes modularity. However, as the number of 5 
cliques increases, the common sense solution continues to 
minimize support variance, but is discarded by modular- 
ity optimization methods in favor of larger communities. 

VII. CONCLUSIONS 

We have applied models and algorithms from facility 
location theory to the problem of community detection, 
yielding an algorithm to compute a provable lower bound 
on a minimization variant of local modularity, a support 
measure that can be computed without sampling, and a 



randomized rounding heuristic that can be generalized 
into a class of heuristics. We have also introduced a new 
measure for evaluating the quality of community struc- 
tures. The effectiveness of our heuristics for large graphs 
remains open, but the solution techniques themselves are 
scalable and based upon simple traversals of the network 
that are massively parallelizable in a more natural way 
than the priority queue-based methods previously pub- 
lished. We will explore the scalability of our methods on 
supercomputers in future work. 
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